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The inference of networks of dependencies by Gaussian Grapiiical models on high-throughput data is an open 
issue in modem molecular biology. In this paper we provide a comparative study of three methods to obtain 
small sample and high dimension estimates of partial correlation coefficients: the Moore-Penrose pseudoinverse 
(PINV), residual correlation (RCM) and covariance-regularized method (i'ac)- We first compare them on simu- 
lated datasets and we find that PINV is less stable in terms of AUG performance when the number of variables 
changes. The two regularized methods have comparable performances but lie i^ much faster than RGM. Finally, 

^^ ' we present the results of an application of (2c for the inference of a gene network for isoprenoid biosynthesis 

CN I pathways in Arabidopsis thaliana. 

INTRODUCTION 

One of the aims of systems biology is to provide quantitative models for the study of complex interaction patterns among genes 
| 7 | and their products that are the result of many biological processes in the cell, such as biochemical interactions and regulatory 

^) . activities. In this framework, graphical models yj] have been exploited as useful stochastic tools to investigate and describe 
^^ ' the conditional independence structure between random variables. In particular, the Graphical Gaussian Models (GGM) use 
O the partial coiTelation estimates as a measure of conditional independence between any two variables [2]. Unfortunately, the 
C^ application of GGMs classical theory is still a hard task. The genomic data are tipically characterized by a huge number of 
genes p with respect to the small number of available samples n. This makes unreliable the application of the classical GGMs 
theory to the small sample setting case. In recent years, several methods have been proposed to overcome this problem by 
reducing the numbers of genes or gene lists in order to reach the n> p regime yD. Other solutions have been also proposed 
||4|-|6t] to circumvent the problem of computing full partial correlation coefficients by using only zero and first order coefficients. 
However, these approaches do not take into account all multigene effects on each pair of variables. A more sophisticated way to 
adapt GGMs to the n < p case is to find regularized estimates for the covariance matrix ||7|-|9t] and its inverse. Once regularized 
psj estimates of partial correlation are available, heuristic searches can be used to find an optimal graphical model. A fundamental 

^^ assumption to perform these quantitative methods is the sparsity of biological networks: only a few edges are supposed to be 

r — present in the gene regulatory networks, so that reliable estimates of the graphical model can be inferred also in small sample 
f^ case |5]. A regularized GGM method based on a Stein-type shrinkage has been applied to genomic data 1 10] and the network 



selection has been based on false discovery rate multiple testing. In Ref. Illlll the same procedure to select the network has 
been adopted, with a Moore-Penrose pseudoinverse method to obtain the concentration matrix. Finally, the authors in Ref. il2ll 
have suggested an attractive and simple approach based on lasso-type regression to select among the partial correlations the 
nonzero values, paving the way to a number of analysis and novel algorithms based on lasso (.\ regularizations 171491 1 1 311 . In 
5, this work, we focus on regularized methods for the estimation of the concentration matrix in an undirected GGM. In particular, 

d we present a comparative study of three methods in terms of AUC (area under the Receiving Operative Characteristic curve) 
and timing performances. One is based on Moore-Penrose pseudoinverse (PINV), the other two provide an estimate of the 
partial correlation coefficients, based on Regularized Least Square regression (RCM) and a covariance-regularized method with 
a I2 penalty in the log-likelihood function (^2c)- Finally, we apply the I2C method to infer a gene network for the isoprenoid 
biosynthesis pathways in A. thaliana. This network structural analysis allows to enlight some expected pathway properties. In 
particular, we find a negative partial correlation coefficient between the two hubs in the two isoprenoid pathways. This suggests 
a different response of the pathways to the several tested experimental conditions and, together with the high connectivity of the 
two hubs, provides an evidence of cross-talk between genes in the plastidial and the cytosolic pathways. 

GAUSSIAN NETWORKS FROM MICROARRAY DATA 

Let X — [Xi ,... ,Xp) e M.P be a random vector distributed according a multivariate normal distribution ^(/x, S). The inter- 
action structure between these variables can be described by means of a graph G = {V,E), where V is the vertex set and E is the 
edge set. If vertices of V are identified with the random variables Xi,. . . ,Xp, then the edges of E can represent the conditional 
dependence between the vertices. In other words, the absence of an edge between the /— th and j— th vertex means a conditional 



independence between the associated variables Xj and Xj. In this study, we shall consider only undirected Gaussian graphs G 
with pairwise Markov property, such that for all {i,j) ^ E one has 

X,ALXj\Xy\[ij} ij = l,...,p, (1) 

i.e. Xi and Xj are conditionally independent being fixed all other variables Xy\ r, n. Since X follows a /?— variate normal 
distribution, the condition ([TJ turns out to be Pij.v\{i.j} = 0, where Pij-v\iLj} i^ '^he partial correlation coefficient between the 
i—th and j—th variable, being fixed all other variables. It has been shown UJ] that partial correlation matrix elements are related 
to the precision matrix (or inverse covariance matrix) = S^', as: 

9ij 

pu-v\{ij} = — 7ri= ' ^ ■'' ' ^^^ 

where 0,j are elements of $7. In general, when the number of observations n is greater than the number of variables p, it is 
straightforward to evaluate 9ij in Eq. (|2]i by inverting the sample covariance matrix. Unfortunately, a typical genomic dataset is 



characterized by n < p, so that the sample covariance matrix becomes not invertible 111411 . For this reason, in order to estimate the 
partial correlation matrix one needs alternative methods to overcome the problem, like regularization methods, ridge regression 
or pseudoinverse. 

Partial correlation matrix estimation 

In order to describe the three methods that we shall investigate, let us consider the n x p matrix X — (Xi ,X2, . . . ,Xp), where 
each {X,} e R", with n < p. Let us indicate S as the estimate of the covariance matrix S and as the estimate of inverse 
covariance matrix S^'. 

Pseudoinverse method (PINV) 

The precision matrix © can be obtained as pseudoinverse of S, by using the Singular Value Decomposition (SVD). Indeed, 
a singular value decomposition of a mx q matrix M, is M = L'^AV* , where f/ is a m x m unitary matrix, A is m x ^ diagonal 
matrix with nonnegative real numbers on the diagonal and V* is a ^ x ^ unitary matrix (transpose conjugate of V). Then, the 
pseudoinverse of M is M+ = VA+t/*, where A+ is obtained by replacing each diagonal element with its reciprocal and then 
transposing the matrix. 

Covariance-regularized method (i'ac) 

Let us consider a log likelihood function with a (.2 penalization ||9|] : 

L(0)=logdet0-Tr(S0)-A||0||^, (3) 

with A > and ||0|||^ = tr(0^0). The maximization of Eq. ^ with respect to is equivalent to solve the following equation 

0-'-2A0 = S. (4) 

Consequently, the problem turns out to be an eigenvalue problem, therefore the eigenvalues 0,- of can be evaluated as function 
of the eigenvalues s, of S: 



Since must be positive definite, the correct value of 0, is Q^ then, for the spectral theorem the precision matrix is given by 

Q>^Y.^+ninJ . (6) 

Finally, in order to estimate the parameter X that maximizes the penalized log-likelihood function in Eq. (O, we carry out 20 
random splits of the data set in training and validation sets and then we evaluate the log-likelihood over the validation set. 



n 


% 


PINV 


RCM 


AUG 


AUG std 


T(s) 


AUG 


AUG std 


T(s) 


AUG 


AUG std 


T(s) 


r 500 


0.998 


0.0001 


38.86 


0.987 


0.0006 


0.161 


0.999 


0.0001 


8343 


h 500 


1.000 


0.0000 


83.74 


0.999 


0.0000 


0.164 


1.000 


0.0000 


6468 


c 500 


0.995 


0.0002 


84.95 


0.963 


0.0014 


0.164 


0.996 


0.0002 


6449 


r 200 


0.976 


0.0003 


38.44 


0.581 


0.0161 


0.111 


0.984 


0.0006 


3566 


h 200 


1.000 


0.0000 


81.13 


0.806 


0.0150 


0.115 


0.999 


0.0001 


3555 


c 200 


0.936 


0.0008 


82.02 


0.587 


0.0049 


0.121 


0.923 


0.0009 


3747 


r 20 


0.808 


0.0011 


39.03 


0.929 


0.0018 


0.093 


0.924 


0.0017 


105 


h 20 


0.999 


0.0001 


82.03 


1.000 


0.0000 


0.091 


0.999 


0.0000 


106 


c 20 


0.668 


0.0014 


82.13 


0.659 


0.0014 


0.091 


0.659 


0.0014 


108 



TABLE I: AUG, AUG standard error and timing performances for p = 400. Left part: IL^c method. Center part: PINV. Right part: RGM. 
Indices r, h and c stand for random, hubs and clique pattern, respectively. 



Residual correlation method (RGM) 

We consider a regression model for the variables X, and Xy as 

X,- = (/3(,) , Xvv) + bi Xj = {I3^j) , X\,v) + bj 



(7) 



where {/3(,)} is the regression coefficient vector in p — 2 dimensions referred to the ;— th gene; X, is the /— th column of the 
matrix X and X\,\y is X without the /— th and y— th columns. The Regularized Least Square (RLS) 11511 method evaluates the 
regression models (|7|i by solving 



min -|1X,.-/3(,.)X\,\J2 + ^P 



peKP 



(8) 



-Xi andrj=Xj-Xj. 



Now, if X, and Xj are the RLS estimates of X; and X^, one can evaluate the residual vectors r; = X, 

This allows to evaluate the partial correlation coefficients Pij\p-2 between the /— th and /'— th variable being fixed all other p — 2 

variables as the Pearson correlation r^n between the residuals, i.e. 



Pij\p-2 - rr, 



cov(ri,rj) 



^var(ri) • var(ri) 
Finally, the A > parameter has been chosen by minimizing the Leave-One-Out cross validation errors. 



(9) 



COMPARATIVE STUDY OF ACCURACY 



Data generation 



Datasets with different numbers of variables and observations have been used in order to investigate the performances of 
the methods, i.e. p = {50,200,400} and n — {20,200,500}. Each dataset X has been generated from a multivariate gaussian 
distribution with zero mean and covariance Sth ~ ®7\} ■ The structure of the precision matrix 0th presents the following patterns 



II13II : random, hubs and cliques and it has approximately p non vanishing entries out of the p{p — l)/2 off-diagonal elements, 
except for clique configuration where the entries are approximately 2p. 

In the random pattern, the off-diagonal terms of ©th are set randomly to a fixed value 7^ 0. In the hubs configuration, we 
partition the columns into disjoint groups G^, where index k indicates the fe— th column chosen as "central" in each group. Then 
the off-diagonal terms are set 0,^ = if / G G^, otherwise dfk = 0. In the cliques pattern, the precision matrix is partitioned as 
done in hubs and the off-diagonal terms 0,y are set to Q if /, j G Gk, with / ^ j. The positive definiteness for each configuration, 
is guaranteed by the diagonal entries which are selected in order to keep ©th diagonally dominant. 



n 


^2C 


PINV 


RCM 


AUG 


AUG std 


T(s) 


AUG 


AUG std 


T(s) 


AUG 


AUG std 


T(s) 


r 500 


0.999 


0.0001 


5.807 


0.999 


0.0001 


0.0377 


0.999 


0.0001 


807 


h 500 


1.000 


0.0000 


10.655 


1.000 


0.0000 


0.0376 


1.000 


0.0000 


450 


c 500 


0.996 


0.0002 


10.821 


0.999 


0.0001 


0.0439 


0.999 


0.0000 


436 


r 200 


0.986 


0.0003 


5.592 


0.703 


0.0067 


0.0310 


0.990 


0.0007 


861 


h 200 


1.000 


0.0000 


10.425 


0.748 


0.0124 


0.0309 


0.999 


0.0003 


856 


c 200 


0.944 


0.0010 


10.529 


0.612 


0.0064 


0.0336 


0.950 


0.0008 


1028 


r 20 


0.784 


0.0016 


6.150 


0.880 


0.0048 


0.0187 


0.871 


0.0046 


24.5 


h 20 


0.999 


0.0001 


10.574 


0.999 


0.0002 


0.0182 


0.999 


0.0001 


27.9 


c 20 


0.669 


0.0016 


10.545 


0.649 


0.0017 


0.0189 


0.654 


0.0017 


25.3 



TABLE II: AUG, AUG standard error and timing performances for p = 200. Left part: l^c method. Center part: PINV. ./^ig/if part: RGM. 
Indices r, h and c stand for random, hubs and clique pattern, respectively. 

Performances 

In order to compare the performances of the three methods, we have used this procedure: (I) For each data generation pattern, 
draw a random dataset X from ^(0, Sth); (II) Evaluate S and ©exp in the case of PINV and t^c^ hence find pexp from Eq. (|2]i; 
in the case of RCM use Eq. ^ for the evaluation of Pexp; (HI) For each method, evaluate the AUC performance, as follows. 
Since the edges in our simulated dataset have the same strength and we know the label edge and non edge for each element, the 
elements of Pexp can be divided in two sets: Pexp for the edge elements and Pexp for the non edge ones. The AUC measures the 
performances of the three methods in terms of accuracy of classification of edge and non edges by using the relative Pexp values. 

RESULTS 

In TablesHllnlandlllllwe present the AUC, AUC standard error and timing (in seconds) performances for p = {400,200,50}, 
respectively. Each table is divided in three columns related to the analyzed methods. Indices r, /;, and c refer to the three data 
generation methods: random, hubs, and clique. The results shown are averaged over 20 trials for n = {500,200,20}. 

As expected, when n> p all methods provide the same efficiency with an AUC virtually equal to 1 . In fact, in this case the use 
of regularization methods should be not required. When p > n,we find that PINV presents some instability in AUC outcomes, 
mainly in those region when p w n. This can be due to a "resonance effect", as explained in Refs. llll[ll6l] . Instead, RCM and 
£2c show high value of AUC in all settings and have similar performances, almost indipendently of the range of p and n. Note 
that, only in the random configuration, when « == 20 and p — {200,400}, RCM shows AUC values 10% larger than £20 ones. 
On the other hand, the timing comparison highlights that i2c is much faster than the RLS-based method. 



APPLICATION TO BIOLOGICAL PATHWAYS 



Isoprenoids play various important roles in plants, functioning as membrane components, photosynthetic pigments, hormones 
and plant defence compounds. They are synthesized through condensation of the five-carbon intermediates isopentenyl diphos- 
phate (IPP) and dimethylallyl diphosphate (DMAPP). In higher plants, IPP and DMAPP are synthesized through two different 
routes that take place in two distinct cellular compartments. The cytosolic pathway, also called MVA (mevalonate) pathway, 
provides the precursors for sterols, ubiquinone and sesquiterpenes |17]. An alternative pathway, called MEP/DOXP (2-C- 
methyl-D-erythritol 4-phosphate / 1-deoxy-D-xylulose 5-phosphate), is located in the chloroplast and is used for the synthesis 
of isoprene, carotenoids, abscisic acid, chlorophylls and plastoquinone llSll . Although this subcellular compartmentation allows 
both pathways to operate independently, there are several evidences that they can interact in some conditions [19]. Inhibition 
of the MVA pathway in A. thaliana leads to an increase of carotenoids and chlorophylls levels, demonstrating that its decreased 
functioning can be partially compensated for by the MEP/DOXP pathway. Inversely, inhibition of the MEP/DOXP pathway 
in seedlings causes the reduction of levels in carotenoids and chlorophylls, indicating a unidirectional transport of isoprenoid 
intermediates from the chloroplast to the cytosol. In order to investigate whether the transcriptional regulation is at the basis 
of the crosstalk between the cytosolic and the plastidial pathways, Laule et al. 11911 have studied this interaction by identifying 



n 


^2C 


PINV 


RCM 


AUC 


AUC std 


T(s) 


AUC 


AUC std 


T(s) 


AUC 


AUC std 


T(s) 


r 500 


0.999 


0.0000 


0.4401 


1.000 


0.0000 


0.0152 


1.000 


0.0000 


2.76 


h 500 


1.000 


0.0000 


0.4506 


1.000 


0.0000 


0.0061 


1.000 


0.0000 


4.19 


c 500 


0.999 


0.0000 


0.4184 


1.000 


0.0000 


0.0065 


1.000 


0.0000 


3.45 


r 200 


0.996 


0.0004 


0.4206 


0.997 


0.0004 


0.0038 


0.998 


0.0004 


1.92 


h 200 


1.000 


0.0000 


0.4266 


1.000 


0.0000 


0.0030 


1.000 


0.0000 


2.26 


c 200 


0.976 


0.0023 


0.3971 


0.985 


0.0009 


0.0036 


0.978 


0.0011 


2.10 


r 20 


0.821 


0.0047 


0.4106 


0.654 


0.0097 


0.0024 


0.815 


0.0066 


1.56 


h 20 


1.000 


0.0000 


0.4174 


0.542 


0.0076 


0.0019 


0.866 


0.0081 


1.43 


c 20 


0.675 


0.0052 


0.3776 


0.574 


0.0076 


0.0022 


0.666 


0.0057 


1.48 



TABLE III: AUC, AUC standard error and timing performances for p = 50. Left part: lie method. Center part: PINV. Right part: RCM. 
Indices r, h and c stand for random, hubs and clique pattern, respectively. 

the genes with expression levels changed as a response to the inhibition. They have shown that the inhibitor mediated changes 
in metabolite levels are not reflected in changes in gene expression levels, suggesting that alterations in the flux through the 
two isoprenoid pathways are not transcriptionally regulated. In order to clarify the interaction between both pathways at the 
transcriptional level, Wille et al. f4'l have explored the structural relationship between genes on the basis of their expression 
levels under different experimental conditions. This study aims to infer the regulatory network of the genes in the isoprenoid 
pathways by incorporating the expression levels of 795 genes from other 56 metabolic pathways. Moving beyond the one-gene 
approach, the authors have found various connections between genes in the two different pathways, suggesting the existence of 
a crosstalk at the transcriptional level. 



Results from the covarlance-regularized method for A. thaliana isoprenoid pathways 



We apply the ^2C method to the publicly available data set from Ref. |4] . The selection of the graph is performed by computing 
the 95% bootstrap confidence interval of the statistics and the absence of an edge occurs when the zero is included in this interval. 
The data consist of expression measurements for 39 genes in the isoprenoid pathways and 795 in other 56 pathways assayed on 
118 Affymetrix GeneChip microarrays. We are interested in the construction of a gene network in the two isoprenoid pathways 
in order to detect the effects of genes in the other pathways. In Fig. [T] we reproduce the inferred network with 44 edges. For 
each pathway we find a module with strongly interconnected and positively correlated genes. This suggests the reliability of our 
method since genes within the same pathway are potentially jointly regulated Ii20i1 . Furthermore, we identify two strong candidate 
genes for the cross-talk between the pathways: HMGS and HDS. HMGS represents the hub of the cytosolic module, because 
it is positively correlated to five genes of the same pathway: DPPSl, MDPCl, AACT2, HMGR2 and MK. It encodes a protein 
with hydroxymethylglutaryl-CoA synthase activity that catalyzes the second step of the MVA pathway. HDS represents the hub 
of the plastidial module, because it is positively correlated to five genes of the same pathway: DXPSl, MECPS, GGPPS12, 
IPPIl and PPDS2. It encodes a chloroplast-localized hydroxy-2-methyl-2-(E)-butenyl 4-diphosphate synthase and catalyzes 
the penultimate step of the biosynthesis of IPP and DMAPP via the MEP/DOXP pathway. The negative correlation between 
HMGS and HDS means that they respond differently to the several tested experimental conditions. This, together with the high 
connectivity of the two hubs, provides an evidence of cross-talk between genes in the plastidial and the cytosolic pathways. Other 
negative correlations between the two pathways are represented by the edges HMGR2-MECPS, MPDC2-PPDS2 and MPDC2- 
DXPS2. Interestingly, the plastidial gene IPPIl is found to be positively correlated to the module of connected genes in the 
MVA pathway (IPPIl-MK, IPP1-IPPI2). This evidence confirms the results of Ref. [6] where they guess that the enzyme IPPIl 
controls the steady-state levels of IPP and DMAPP in the plastid, when a high level of transfer of intermediates between plastid 
and cytosol takes place. Moreover, our study shows three candidate mitochondrial genes for the cross-talk (DPPS2, GGPPS5 
and UPPSl) which are in the plastidial module. Finally, it is interesting to note that the method used in Ref. |4] includes more 
cross-links between the two pathways with respect to the iic method. Although from the literature it is known the existence of 
an interaction between the two pathways, we believe that this cross-link should not be so strong, as genes of the two pathways 
belong to two different cell compartments. A possible explanation of such a difference is that Wille et al. construct a network 
based on the first-order conditional dependence that may not capture multi-gene effects on a given pair of genes. 
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FIG. 1: Biological network of the isoprenoid pathways inferred by using PLLM. Upper part: Genes of MVA pathway. Lower part: Genes of 
MEP/DOXP pathway. Grey boxes refer to mithochondrial genes; HMGS and HDS represent the hubs of the two modules. 



CONCLUSIONS 

In this paper, we present a comparative study of three different methods to infer networks of dependencies by estimates of 
partial correlation coefficients in the typical situation when n < p. In particular, we consider the Moore-Penrose pseudoinverse 
method (PINV), the residual correlation method (RCM) and a covariance-regularized method {(2c)- Firstly, we evaluate AUCs 
and timing performances on simulated datasets and we find that PINV presents some instability in AUC outcomes associated to 
the variable number variations. On the other hand, the two regularized methods show comparable performances with a sensible 
gain of time elapsing of ^2C with respect to RCM. Finally, we present the results of an application of ^2C f™" the inference 
of a gene network for isoprenoid pathways in A. thaliana. We find a negative partial correlation coefficient between HMGS 
and HDS, that are the two hubs in the two isoprenoid pathways. This means that they respond differently to the several tested 
experimental conditions and, together with the high connectivity of the two hubs, provides an evidence of cross-talk between 
genes in the plastidial and the cytosolic pathways. This evidence did not result from studies at level of single gene. Moreover, 
studies that infer this network by using only low-order partial correlation coefficients find more interactions between the two 
pathways with respect to the I2C method. A reduced number of edges between the two pathways is plausible considering the 
different cell compartmentalization of the two isoprenoid biosynthesis pathways. 

This work was supported by grants from Regione Puglia PO FESR 2007-2013 Progetto BISIMANE (Cod. n. 44). 
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